Build a recursive word finding algorithm with Python - Part 4

2018-05-23 Programming

by Dan Duda

Requirements

Python 3.x
PyCharm Community (optional)
A text file of English words
Basic Python knowledge

Introduction

Is there a way we could store the list of words so that it’s more efficient to check if a string of letters is a valid word? The answer of course is yes. One way that I’m particularly fond of is to create a tree made up of nodes representing letters. The tree would start with all 26 letters and the child nodes of each letter node would form a path to valid words. Let’s look at a diagram to help explain it.

Here is a partial sample of what our tree might look like. From this we can have many optimizations. For example, say we’re checking the word part, ‘cb’. If we traverse our sample tree we first go to the ‘c’ node. We then check if the ‘c’ node has a child node of ‘b’. It does not so we can stop processing right there. If we had a bunch of remaining letters our current algorithm would keep on adding every combination of the letters onto ‘cb’ and checking against our word list. If we know ‘cb’ is not a valid path, then we know there can be no words starting with ‘cb’ so we can quit processing early.

Now since multiple words can start with the same combination of letters like “cat” and “cater” we need to know when a path is a valid word. To do this we can just add a flag to our nodes indicating if it’s a valid word. So, the first ‘c’ node under the root node would be set to false because there are words that start with ‘c’ but would not be a valid word itself. Same with the ‘a’ node under the ‘c’ node’. But when we get to the ‘b’ node we’d signify that it’s a valid word. The same with the ‘T’ node which would give us “cat”.

So how do we do this in code? First let’s create a new python file named “word_finder_tree.py”. In this file we’re going to create two classes. We’ll have our “WordFinderTree” class as well as a helper class named “Node”.

class Node:
    def __init__(self, letter, is_word):
        self.letter = letter
        self.is_word = is_word
        self.children = {}


class WordFinderTree:
    def __init__(self, word_file_path):
        self.root_node = Node(' ', False)

        f = open(word_file_path, "r")
        for line in f:
            word = line.rstrip()

            node = self.root_node
            for c in word:
                if c not in node.children:
                    node.children[c] = Node(c, False)

                node = node.children[c]

            node.is_word = True

        f.close()

    def is_word(self, word):
        node = self.root_node
        for c in word:
            if c in node.children:
                node = node.children[c]
            else:
                return False

        return node.is_word

    def is_start_of_word(self, word):
        node = self.root_node
        for c in word:
            if c in node.children:
                node = node.children[c]
            else:
                return False

        return True

    def find_words(self, letters):
        found = []
        self.search_words('', letters, found)
        found.sort()
        return found

    def search_words(self, word_part, remaining_letters, found_words):
        if not self.is_start_of_word(word_part):
            return

        if len(word_part) >= 3 and self.is_word(word_part):
            if word_part not in found_words:
                found_words.append(word_part)

        for i in range(len(remaining_letters)):
            remaining_letters_copy = remaining_letters.copy()
            next_letter = remaining_letters_copy[i]
            del remaining_letters_copy[i]
            self.search_words(word_part + next_letter, remaining_letters_copy, found_words)

The Node class just has a constructor that takes a letter as a parameter and a boolean signifying if it’s a word. The constructor also creates an empty dictionary called “children” which will hold any child nodes. This dictionary creates the pointers to the child nodes so that every node can be visited by starting with the root node.

Just like our “WordFinderSimple” simple class, the WordFinderTree constructor takes a path parameter to read the dictionary file. But unlike our simple version that just reads the file into a words list, we are going to create our letter tree from the file in this constructor.

We first create an empty “root node” so we have something to attach all our letter nodes to. We then read the file and loop over each word. We create a variable called node and set it equal to the root node. We’ll use this to keep track of where we are in the tree. For each word we loop through the individual letters of the word. We check if the “children” dictionary contains the letter and if it doesn’t we add it. We set node to be this new node. If the node is already a child, we set our node variable to that node. Once we reach the last letter in the word we set the current node that we’re on to be a valid word by setting the “is_word” property to “True”.

We continue this process with each word, building up the tree as we go. Notice that we only create new nodes when the path doesn’t already exist. So, if we just added the word “cab” and the next word is “cad”, the ‘c’ node and its child node ‘a’ already exist in the tree and we’ll just add the ‘d’ child node.

The constructor models the word list as a tree. We have our “find_words” function just like in our simple class which is the function we’ll call from our main program. We also have a recursive “search_words” function. We also add two additional helper functions to check if a word part is a valid word.

The rest of the changes are in the recursive “search_words” function. The first thing we do now is check if the passed in word_part is the start of a word. If it is not then we are done with the current path. We then use the “is_word” function to test if it’s a valid word and the rest is the same.

Back in our main program let’s import this new class and utilize it. I’ve commented out the simple version since it is too slow. Let’s focus on comparing the binary search version and the tree version.

import time
from word_finder_simple import WordFinderSimple
from word_finder_bsearch import WordFinderBSearch
from word_finder_tree import WordFinderTree


def display_words(words):
    print(f'{len(words)} words found')
    for w in words:
        print(w)


wf_simple = WordFinderSimple('words_alpha.txt')
wf_bsearch = WordFinderBSearch('words_alpha.txt')
wf_tree = WordFinderTree('words_alpha.txt')

s = input('Enter a string: ')
while s != '':
    letters = list(s.lower())

    #start = time.time()
    #words = wf_simple.find_words(letters)
    #end = time.time()
    #print(f'{"Time Simple:":13} {end - start} seconds')

    start = time.time()
    words = wf_bsearch.find_words(letters)
    end = time.time()
    print(f'{"Time BSearch:":13} {end - start} seconds')

    start = time.time()
    words = wf_tree.find_words(letters)
    end = time.time()
    print(f'{"Time Tree:":13} {end - start} seconds')

    # display_words(words)

    s = input('Enter a string: ')

And the winner is?

Using 8 letters this time with “agertsbw” I get:

1
2
3

Enter a string: agertsbw
Time BSearch: 0.3061807155609131 seconds
Time Tree:    0.0049865245819091 seconds

Adding just 2 more letters really shows how the tree version shines.

1
2
3

Enter a string: agertsbwon
Time BSearch: 66.0638256072998 seconds
Time Tree:     0.0269279479980 seconds

As we can see here the tree version is the best algorithm so far. Let’s modify our main program to use the tree version exclusively and remove our timing code.

from word_finder_tree import WordFinderTree


def display_words(words):
    print(f'{len(words)} words found')
    for w in words:
        print(w)


wf_tree = WordFinderTree('words_alpha.txt')

s = input('Enter a string: ')
while s != '':
    letters = list(s.lower())
    words = wf_tree.find_words(letters)
    display_words(words)

    s = input('Enter a string: ')

We now have a useful program for finding all words from a string of letters.

In the next tutorial I’ll show you how we can add some functions to our WordFinderTree to solve Boggle boards.

Stay Tuned…

#Python #Recursion #Tree #Data Structures